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Abstract 

Spoken  language  recognition  requires  a  series  of  signal  process¬ 
ing  steps  and  learning  algorithms  to  model  distinguishing  char¬ 
acteristics  of  different  languages.  In  this  paper,  we  present  a 
sparse  discriminative  feature  learning  framework  for  language 
recognition.  We  use  sparse  coding,  an  unsupervised  method, 
to  compute  efficient  representations  for  spectral  features  from 
a  speech  utterance  while  learning  basis  vectors  for  language 
models.  Differentiated  from  existing  approaches,  we  introduce 
a  maximum  a  posteriori  (MAP)  adaptation  scheme  that  further 
optimizes  the  discriminative  quality  of  sparse-coded  speech  fea¬ 
tures.  We  empirically  validate  the  effectiveness  of  our  approach 
using  the  NIST  LRE  2015  dataset. 

Index  Terms:  speech  recognition,  sparse  coding 

1.  Introduction 

Originally  used  to  explain  neuronal  activations  [1],  sparse  cod¬ 
ing  emerges  as  an  effective  means  to  discover  underlying 
structures  of  unknown  data.  High-level  feature  representations 
learned  from  sparse  coding  occasionally  have  resulted  the  best 
performance  for  discriminative  tasks  in  computer  vision.  Yet, 
sparse  coding  of  speech  features — or  audio  signals  in  general — 
has  not  been  explored  extensively.  In  this  paper,  we  investigate 
a  discriminative  learning  framework  based  on  sparse  coding  for 
language  recognition. 

Language  recognition  is  a  systematic  process  of  identify¬ 
ing  the  spoken  language  in  a  speech  utterance.  Over  the  years, 
Gaussian  mixture  models  (GMMs)  [2]  and  support  vector  ma¬ 
chine  (SVM)  [3]  have  been  crucial  to  build  a  high-performance 
language  identification  (LID)  system.  More  recently,  the  idea 
of  total  variability  space  or  i-vector  [4]  has  been  studied  for 
LID.  Motivated  by  joint  factor  analysis  (JFA)  approach  [5]  for 
speaker  verification,  i-vector  approaches  are  known  to  perform 
better  than  JFA. 

Sparse  coding  has  been  previously  applied  to  speaker  and 
language  identification  [6,  7,  8],  Despite  much  interest  from  the 
machine  learning  community,  there  is  surprisingly  little  work  in 
sparse  coding  for  speech.  In  a  classification  pipeline  for  sparse 
coding,  a  simple  classifier  such  as  linear  SVM  is  trained  on  the 
learned  sparse  feature  vectors  and  known  to  perform  on  par  with 
(or  better  than)  more  complex  nonlinear  schemes  (e.g.,  deep 
neural  networks,  kernel  SVM)  [9].  One  possible  explanation 
is  that  sparse  coding  can  achieve  a  near-optimal  approximation 
of  much  complicated  nonlinear  relationship  through  local  and 
piecewise  linear  functions. 
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We  structure  the  rest  of  this  paper  as  follows.  In  Section  2, 
we  present  a  background  on  sparse  coding.  Section  3  describes 
our  sparse  coding-based  approaches  for  language  recognition. 
In  particular,  we  propose  adaptive  sparse  coding  (ASC),  an 
enhancement  to  the  semi-supervised  classification  pipeline  on 
vanilla  sparse  coding,  and  discuss  an  online  method  for  per- 
utterance  dictionary  adaptation.  As  a  result,  we  can  significantly 
improve  the  discriminative  quality  of  sparse-coded  speech  fea¬ 
tures.  In  Section  4,  we  evaluate  the  proposed  approaches  against 
an  i-vector  based  benchmark  pipeline  developed  by  Lincoln 
Laboratory  and  MIT  on  a  subset  of  the  NIST  LRE  2015  com¬ 
prising  the  Arabic  and  Chinese  clusters.  Section  5  concludes  the 
paper. 

2.  Sparse  Coding  Background 

Sparse  coding  is  an  unsupervised  method  to  learn  an  efficient 
representation  of  data  using  a  small  number  of  basis  vectors.  It 
has  been  used  to  discover  higher-level  features  present  in  data 
from  unlabeled  examples.  Given  an  example  x  £  RN,  sparse 
coding  searches  for  a  representation  y  €  (i.e.,  the  feature 

vector  for  x)  while  simultaneously  updating  the  dictionary  D  £ 
RNxK  of  K  basis  vectors  by 

min||x- Dy||a  +  A||y||r  s.t.  ||df  ||2  <  l,Vi  (1) 

D,y 

where  d;  is  ith  dictionary  atom  in  D.  and  A  is  a  regulariza¬ 
tion  parameter  that  penalizes  over  the  fi-norm,  which  induces  a 
sparse  solution.  With  K  >  N,  sparse  coding  typically  trains  an 
overcomplete  dictionary.  This  makes  the  sparse  code  y  higher  in 
dimension  than  x.  but  only  S  <C  N  elements  in  y  are  nonzero. 

A  more  direct  way  to  control  sparsity  is  to  regularize  on  the 
Iq  pseudo-norm  ||y||o,  describing  the  number  of  nonzero  ele¬ 
ments  in  y.  However,  it  is  known  to  be  intractable  to  compute 
the  sparsest  lo  solution  in  general.  The  approach  in  Eq.  (1)  is 
called  least  absolute  shrinkage  and  selection  operator  (LASSO) 
[10],  a  convex  relaxation  of  the  fo  sparse  coding  that  induces 
sparse  y’s.  We  use  least  angle  regression  (LARS)  [11]  to  solve 
the  LASSO  problem.  We  also  consider  orthogonal  matching 
pursuit  (OMP)  [12],  a  greedy-fo  sparse  coding  algorithm  that 
computes  the  fo  sparse  coding  problem  extremely  fast  by 

min  1 1 x  -  Dy||a  s.t.  ||y||0  <  S.  (2) 

D.y 

OMP  finds  at  most  an  S-sparse  y  explicitly. 

3.  Our  Approach 

3.1.  Shifted  delta  cepstral  feature  extraction 

We  use  a  spectral-based  technique  by  Torres  et  al.  [13.  14]  to 
process  speech  waveforms.  Speech  is  analyzed  with  a  Ham¬ 
ming  window  of  20-msec  duration  at  a  10-msec  frame  rate. 
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Figure  1 :  Vanilla  sparse  coding  pipeline 


The  windowed  speech  waveforms  pass  through  a  mel-scale  fil- 
terbank  and  RASTA  filtering  with  per-utterance  normalization 
to  zero  mean  and  unit  variance.  Using  the  7- 1-3-7  scheme,  we 
calculate  the  shifted  delta  cepstral  (SDC)  coefficients.  Concate¬ 
nating  with  static  cepstra,  the  spectral  features  extracted  from 
speech  form  a  56-dimensional  vector.  Lastly,  we  run  energy- 
based  speech  activity  detection  to  remove  undesirable  back¬ 
ground  noise. 

3.2.  Vanilla  sparse  coding 

The  key  reasoning  for  sparse  coding  is  to  learn  useful  repre¬ 
sentations  by  decomposing  spectro-temporal  features  of  speech 
into  a  sparse  linear  combination  of  basis  vectors  in  a  dictionary 
(also  learned).  Nonzeros  in  the  computed  sparse  code  quantify 
the  presence  of  specific  basis  vectors.  By  exploiting  variation  of 
the  nonzero  locations  and  magnitude,  we  can  build  a  discrimi¬ 
native  pipeline  for  language  recognition. 

Figure  1  describes  a  baseline  sparse  coding  approach  for 
LID,  which  we  call  “vanilla  sparse  coding  (VSC).”  VSC  is  a 
semi-supervised  approach.  Assuming  L  languages  of  interest 
C  £  {Ii, . . . ,  II},  we  perform  sparse  coding  with  an  unbiased 
mix  of  unlabeled  speech  examples  from  all  languages  to  train 
a  dictionary  D  £  R.NxK  during  the  unsupervised  phase.  The 
trained  dictionary  represents  universal  sparse  modeling  of  the 
L  languages.  That  is,  given  an  unknown  speech  input  x  G  Rw, 
we  can  compute  its  sparse  representation  y  G  RA  using  D.  By 
sparse  modeling  assumption,  y  has  only  several  nonzero  ele¬ 
ments 


x  «  j/idi  +  2/2 d2  H - +  yi<dK, 

where  y:)  is  j th  element  in  y,  d  ,  the  j'th  basis  vector  in  D.  We 
use  the  notation  X  =  [x*-1-1  . .  .  x^]  for  a  batch  of  n  unlabeled 
training  examples,  where  x^  G  is  the  ith  example  in  the 
batch.  Optionally,  X  can  be  normalized  and  whitened  before 
sparse  coding  for  better  result. 

The  supervised  phase  uses  a  labeled  dataset.  Consider  m 
labeled  training  examples  in  Xf-  =  [x^  •  •  •  X("^]-  Now,  each 
example  x^-*  =  {x^fc\  l^}  includes  a  language  labell*-^  G  C 
for  x(fe\  Recall  each  x  contains  the  spectral  feature  for  a  single 
frame  (i.e.,  10  msec).  Since  a  speech  utterance  is  much  longer 
(up  to  minutes),  sparse  coding  will  result  in  too  many  feature 
vectors  per  utterance.  Before  the  supervised  training  of  classi¬ 
fiers,  we  perform  pooling,  a  technique  popularized  in  computer 
vision,  across  all  sparse  codes  from  the  same  utterance.  The  pur¬ 
pose  of  pooling  is  two-fold:  1)  aggregation  of  feature  vectors 
and  2)  statistical  robustness. 

3.3.  Enhancement:  adaptive  sparse  coding 

We  propose  an  enhancement  of  VSC  as  illustrated  in  Figure  2. 
We  name  the  approach  “adaptive  sparse  coding  (ASC)."  The 
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Figure  2:  During  the  supervised  phase  of  adaptive  sparse  cod¬ 
ing  pipeline,  per-utterance  dictionary  adaptation  is  performed. 
Difference  vector  between  adapted  and  universal  sparse  models 
is  used  to  train  classifiers. 


unsupervised  phase  of  ASC  is  identical  to  VSC,  and  the  dictio¬ 
nary  D  for  universal  sparse  modeling  of  all  languages  is  first 
learned.  The  basic  idea  of  ASC  is  to  adapt  D  to  the  utterance- 
dependent  dictionary  Da  during  the  supervised  phase.  With 
both  D  and  Da,  we  can  compute  two  sparse  codes  y  and 
ya  for  each  input  vector  x  front  the  same  utterance  such  that 
x  =  Dy  and  x  =  Daya,  respectively.  ASC  takes  in  the  differ¬ 
ence  A  =  ya  —  y  to  train  classifiers  (compared  to  y  for  VSC  as 
in  Figure  1).  Note  that  A  vectors  from  the  same  utterance  are 
also  pooled  before  applied  to  classifiers. 

Our  idea  of  adapted  sparse  coding  dictionaries  and  form¬ 
ing  discriminative  A  is  analogous  to  adapted  GMM-UBM  and 
supervectors  [2,  15].  Consider  a  probabilistic  model  for  sparse 
coding  under  a  Gaussian  noise 


K 

P(x|D,y)  ~  J\T(^2yjdj,a2I)  (3) 

3=1 

where  the  Gaussian  noise  has  a  zero-mean  and  covariance  cr2I. 
A  sparse  prior  p(y)  oc  J"I .  e~x^Vj^  regularizes  the  activations 
on  sparse  code  y.  Note  that  the  hyperparameter  A  is  same  as  the 
regularization  parameter  of  Equation  (1).  We  can  formulate  the 
maximum  a  posteriori  (MAP)  estimation  problem  to  solve  for 
{ya,Da}  jointly 

argmaxp(y'|x,D')  =  arg  maxp(x|D',  y')p(y')-  (4) 

D',y'  D',y' 

Since  p(xjD',y')  is  a  multivariate  Gaussian  density  function, 
we  can  derive  an  analytical  solution  for  Equation  (4).  For  this 
paper,  however,  we  focus  on  efficient  estimation  of  the  adapted 
dictionary  and  sparse  code  by  following  an  online  method  by 
Mairal  et  al.  [16]. 

In  Algorithm  1,  we  present  a  fast  online  algorithm  for  dic¬ 
tionary  adaptation.  This  algorithm  is  guaranteed  to  converge 
and  computes  a  good  estimate  of  Da  from  D  given  an  arbi¬ 
trary  amount  of  utterance  input.  In  particular,  block-coordinate 
descent  in  the  inner-loop  sequentially  updates  each  basis  vector 
(column)  in  the  dictionary.  Since  the  y  vectors  are  sparse,  the 
coefficients  of  the  matrix  A  are  concentrated  on  the  diagonal, 
making  the  search  for  optimal  Da  very  efficient.  For  the  sparse 
coding  step  in  the  inner-loop,  we  can  use  either  LARS  or  OMP. 

3.4.  SVM  classification 

We  consider  support  vector  machines  (SVMs)  for  both  VSC 
and  ASC  pipelines.  The  kernel  trick  for  SVM  has  been  studied 
widely  to  cope  with  cases  where  the  input  vectors  for  SVM  are 
not  linearly  separable.  For  our  case,  sparse  coding  and  pooling 
together  give  reasonably  sufficient  nonlinear  transformation  for 


Algorithm  1  Online  dictionary  adaptation 
1 :  require:  universal  sparse  modeling  dictionary  D  from  un¬ 
supervised  phase 

2:  initialize:  :=  D,  A°  :=  0,  B°  :=  0 

3:  for  t  :=  1  to  T  (inner-loop) 

4:  draw  x  uniformly  random  from  X 

5:  compute  sparse  code  y  for  x  using  D^-1 

6:  update  A*  :=  At_1  +  yyT  and  Bf  :=  Bt_1  +  xyT 

7:  update  by  block-coordinate  descent 

Di  :=  argmimy  Tr^D'A4)  -  Tr(D,TB‘)] 

8:  end 

9:  return:  D', 


Figure  3:  Training  1-vs-all  SVM  for  each  language 


the  SDC  coefficients.  Hence,  we  use  off-the-shelf  linear  SVMs 
only. 

A  well-accepted  strategy  for  a  LID  system  is  to  train  1  -  vs- 
all  classifiers  as  explained  in  Figure  3.  To  train  the  model  for 
language  l,  £  C,  we  input  the  pooled  sparse  codes  for  all  la¬ 
beled  examples  from  U  as  class  0.  For  utterances  from  all  other 
languages  lj  £  we  use  class  1. 

4.  Experiments 

4.1.  Task,  dataset,  and  evaluation  metrics 

To  evaluate  the  performance  of  sparse  coding  pipelines  for  LID, 
we  consider  the  NIST  Language  Recognition  Evaluation  (LRE) 
2015  [17],  The  task  is  to  determine  the  average  performance  of 
a  LID  system  that  can  classify  each  language  as  a  target  within 
six  predefined  language  clusters.  The  language  clusters  are  Ara¬ 
bic,  Chinese,  English,  French,  Slavic,  and  Iberian  with  20  dif¬ 
ferent  languages  in  total.  For  the  time  being,  we  present  a  partial 
evaluation  focusing  only  on  the  Arabic  and  Chinese  clusters.  As 
summarized  in  Table  1,  there  are  5  languages  from  Arabic  and 
4  languages  from  Chinese  in  NIST  LRE  2015. 

The  dataset  comes  in  train,  test,  and  eval  subsets.  We 
use  the  train  and  test  subsets  for  development.  The  amount 
of  development  data  for  each  language  is  uneven.  It  ranges  from 
2.6  (zho-yue)  to  97.5  hours  (ara-arz)  in  speech  duration. 
The  eval  subset  serves  as  held-out  data  to  evaluate  the  classi- 


Table  1:  Arabic  and  Chinese  language  clusters  from  NIST  LRE 
2015 


Cluster 

Target  languages 

Arabic 

Egyptian,  Iraqi,  Levantine,  Maghrebi,  Modern  Standard 

Chinese 

Cantonese,  Mandarin,  Min,  Wu 

fication  performance.  Following  the  2015  evaluation  plan  [18], 
we  adopt  the  NIST  average  cost  performance  as  our  primary 
evaluation  metric 

C'avg  -^7  |  "Cm iaa  '  Ptarget  *  ^  iss  (I'/1 )  j 

L  lT 

+  i  [CpA  •  (1  ^target)  *  TT.pMItJn)]}. 

L  It  In 

In  addition,  we  compute  the  classification  accuracy  metrics  for 
the  nine  1-vs-all  linear  SVMs  on  VSC  and  ASC. 

4.2.  Methods  and  training 

For  comparative  performance  evaluation,  we  have  trained  a 
benchmark  pipeline  that  takes  in  SDC  feature  vectors  in  an 
i-vector  framework  [4],  which  we  call  “IVEC-benchmark.” 
IVEC-benchmark  also  uses  a  7-1-3-7  SDC  scheme  along  with 
static  cepstra  for  the  same  56-dimensional  vector  input  to 
GMM-UBM  and  i-vector  extraction.  For  years,  i-vector  based 
systems  have  been  able  to  produce  state-of-the-art  results  in 
speaker  and  language  recognition  tasks.  IVEC-benchmark  re¬ 
mains  to  be  a  part  of  MIT  Lincoln  Laboratory’s  NIST  LRE  2015 
submission  [19]. 

Before  sparse  coding,  we  normalize  each  input  vector  by 
removing  its  mean  and  dividing  by  the  standard  deviation.  The 
normalized  input  vectors  are  then  ZC A- whitened  [20].  Em¬ 
pirically,  we  choose  ZCA-whitening  over  PCA-whitening,  and 
there  is  no  dimensionality  reduction. 

We  have  tested  multiple  configurations  of  VSC  and  ASC  by 
varying  the  choice  of  sparse  coding  algorithm,  I\ -regularized 
LARS  or  greedy-£o  OMP,  and  the  number  of  basis  vectors  in 
a  dictionary  K  =  512, 1024.  For  example,  ASC-LARS-1024 
denotes  adaptive  sparse  coding  with  LARS  and  a  1,024-basis 
vector  dictionary.  Throughout  our  experiments  with  LARS,  we 
use  a  sparsity  penalty  A  =  0.15.  For  OMP,  we  use  a  sparsity 
bound  S  =  0.1  x  56  «  6. 

During  the  unsupervised  phase,  we  use  the  train  subset 
to  train  D  for  universal  sparse  modeling  of  all  9  languages.  Dur¬ 
ing  the  supervised  phase,  we  partition  test  into  five  folds  and 
do  cross-validation  to  determine  hyperparameters  of  the  SVMs. 
For  ASC,  we  also  adapt  D  to  utterance-specific  dictionaries 
with  the  test  subset  during  the  supervised  phase.  We  have 
tested  average  and  max  pooling  methods  described  below. 

1.  Average:  /({ y(1), . . .  ,y<M>})  =  £  |y(j)| 

2.  Max:/({y«  ...,y<M)}) 

=  maxVfc({|t/fc1)|, . . . ,  |t/fcM)|}) 


4.3.  Fusion 

As  with  the  historical  NIST  LREs,  if  running  multiple  pipelines 
concurrently  is  found  beneficial,  we  should  consider  post¬ 
processing  at  the  backend  that  consists  of  per-pipeline  calibra¬ 
tion  and  fusion.  For  example,  we  can  do  a  simple  linear  fusion 
of  IVEC-benchmark  and  one  of  our  sparse  coding  pipelines: 


^fusion  =  P  ■ 


llr[T  -  p[T 


+  (1  ~P) 


Url2T  -  l4T 


(5) 


Here,  we  use  a  mixing  ratio  p  to  combine  the  scores  (i.e.,  log- 
likelihood  ratios  lln  and  llr2  with  respect  to  a  target  language 
It)  from  the  two  pipelines.  We  can  also  think  of  more  sophis¬ 
ticated  fusion  schemes  on  logistic  regression  and  neural  net¬ 
works. 


Table  2:  Comparison  of  average  cost  performance  (Cavg)  on 
Arabic  and  Chinese  language  clusters 


Pipeline 

Arabic 

Chinese 

IVEC-benchmark 

0.2566 

0.2054 

IVEC-benchmark  (GMM-SAD) 

0.2539 

0.2231 

VSC-LARS-512 

0.2615 

0.2556 

VSC-OMP-512 

0.2823 

0.2699 

VSC-LARS-1024 

0.2393 

0.2043 

VSC-OMP-1024 

0.2486 

0.2120 

ASC-LARS-512 

0.2187 

0.1909 

ASC-OMP-512 

0.2342 

0.2036 

ASC-LARS-1024 

0.1874 

0.1634 

ASC-OMP-1024 

0.2015 

0.1983 

4.4.  Results  and  discussion 

Table  2  presents  the  performance  comparison  on  the  average 
cost  metric  Cavg  for  IVEC-benchmark,  as  well  as  the  proposed 
VSC  and  ASC  pipelines.  These  results  are  obtained  by  running 
the  eval  subset,  which  includes  34,530  utterances  for  Arabic 
and  44,596  utterances  for  Chinese.  The  bold-faced  numbers  rep¬ 
resent  the  best  result  from  each  IVEC,  VSC.  ASC  group.  Notice 
we  include  the  result  for  IVEC-benchmark  under  GMM-SAD, 
which  performs  better  for  Arabic. 

We  observe  that  ASC  makes  a  significant  improvement  over 
VSC.  If  the  choice  of  sparse  coding  algorithm  and  the  num¬ 
ber  of  basis  vectors  in  a  dictionary  K  are  the  same,  ASC  re¬ 
sults  in  consistently  better  cost  performance.  Overcompleteness 
of  sparse  coding  dictionary  is  an  important  hyperparameter  to 
preconfigure.  For  both  VSC  and  ASC,  increasing  K  from  512 
to  1,024  has  always  improved  the  cost  performance.  Also  for 
both  pipelines,  LARS  results  in  a  better  performance.  However, 
the  computation  time  for  LARS  is  found  an  order  of  magnitude 
higher  than  OMR 

5.  Conclusion 

Sparse  coding  has  achieved  state-of-the-art  performance  in 
computer  vision  and  object  recognition.  Despite  its  growing  in¬ 
terest,  there  is  relatively  little  work  in  sparse  coding  for  acous¬ 
tic  language  modeling.  In  this  paper,  we  have  described  semi- 
supervised  approaches  for  sparse  coding  on  the  task  of  language 
recognition.  Differentiated  from  the  existing  sparse  representa¬ 
tion  classification  (SRC),  we  propose  the  MAP  adaptation  on 
the  dictionary  for  sparse  modeling  of  all  languages  to  improve 
the  discriminative  quality  of  sparse-coded  speech  features.  Us¬ 
ing  the  NIST  LRE  2015  dataset,  we  empirically  evaluate  the  ef¬ 
fectiveness  of  our  approaches.  Also,  our  experimental  backend 
results  indicate  that  sparse  coding,  ASC  in  particular,  should  be 
a  viable  component  for  the  top  LID  system. 
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